Between Flexibility and Consistency: Joint Generation of Captions and Subtitles. (arXiv:2107.06246v1 [cs.CL])
(2 min)
Speech translation (ST) has lately received growing interest for the
generation of subtitles without the need for an intermediate source language
transcription and timing (i.e. captions). However, the joint generation of
source captions and target subtitles does not only bring potential output
quality advantages when the two decoding processes inform each other, but it is
also often required in multilingual scenarios. In this work, we focus on ST
models which generate consistent captions-subtitles in terms of structure and
lexical content. We further introduce new metrics for evaluating subtitling
consistency. Our findings show that joint decoding leads to increased
performance and consistency between the generated captions and subtitles while
still allowing for sufficient flexibility to produce subtitles conforming to
language-specific needs and norms.